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Abstract 

Gaussian processes (CPs) provide a nonpara- 
mctric representation of functions. How- 
ever, classical GP inference suffers from high 
computational cost and it is difficult to de- 
sign nonstationary GP priors in practice. 
In this paper, we propose a sparse Gaus- 
sian process model, EigenGP, based on the 
Karhunen-Loeve (KL) expansion of a GP 
prior. We use the Nystrom approximation 
to obtain data dependent eigenf unct ions and 
select these eigenfunctions by evidence maxi- 
mization. This selection reduces the number 
of eigenfunctions in our model and provides 
a nonstationary covariance function. To han- 
dle nonlinear likelihoods, we develop an effi- 
cient expectation propagation (EP) inference 
algorithm, and couple it with expectation 
maximization for eigenfunction selection. Be- 
cause the eigenfunctions of a Gaussian ker- 
nel are associated with clusters of samples 

- including both the labeled and unlabeled 

- selecting relevant eigenfunctions enables 
EigenGP to conduct semi-supervised learn- 
ing. Our experimental results demonstrate 
improved predictive performance of EigenGP 
over alternative state-of-the-art sparse GP 
and semisupervised learning methods for re- 
gression, classification, and semisupervised 
classification. 

1 Introduction 

Gaussian processes (GPs) are powerful nonparametric 
Bayesian models with numerous applications in statis- 
tics and machine learning. However, we face two lim- 
itations when using GPs in practice. First, it is dif- 
ficult to construct nonstationary covariance functions 
for GPs because it is statistically and computation- 
ally challenging to parameterize positive definite co- 



variance matrices as a function of the input space. In 
practice, while nonstationary GPs have been devel- 
oped and applied to real world applications, they are 
often limited to low-dimensional problems, such as ap- 
plications in spatial statistics (Paciorek & Schervish, 
2004; Higdon et al., 1998). Second, GP inference 
is computationally challenging. Even for the regres- 
sion case where the GP prediction formula is ana- 
lytic, training the exact GP model with N points 
demands an 0(N 3 ) computational cost for inverting 
the covariance matrix. To reduce the computational 
cost, a variety of approximate sparse GP inference 
approaches have been developed (Williams & Seeger, 
2001; Csato & Opper, 2002; Snelson & Ghahramani, 
2006; Lazaro-Gredilla et al., 2010; Williams & Barber, 
1998; Qi et al., 2010) - for example, using the Nystrom 
method to approximate covariance matrices (Williams 
& Seeger, 2001) or grounding the GP on a small set of 
(blurred) basis points (Snelson & Ghahramani, 2006; 
Qi et al., 2010). An elegant unifying view for various 
approximate sparse GP regression models is given in 
Quinonero-Candela & Rasmusscn (2005). Note that 
all these sparse GP approaches gain computational ef- 
ficiency with certain approximations - possibly degen- 
erating prediction accuracy. 

In this paper, we propose a new approach, EigenGP, 
that addresses these two issues in a principled frame- 
work. Specifically, we project the GP prior into a space 
spanned by eigenfunctions, and add white noise to it 
to handle prediction uncertainty at infinity. The eigen- 
functions depend on data input so that the covariance 
changes in the input space. Furthermore, based on the 
observed data, we select a small number of eigenfunc- 
tions by maximizing model marginal likelihood. The 
projection of GPs into a small eigensubspace can re- 
move noise in function values; this is similar to what 
principle component analysis does in noise reduction, 
but we do it in a functional space for the output. Fur- 
thermore, with only a few eigenfunctions in the model, 
we can greatly reduce the computational cost; it is 
0{NL 2 ) - rather than 0(N 3 ) - where L is the number 



of the selected eigenfunctions. This selection also en- 
ables semi-supervised learning based on a commonly- 
used clustering assumption. This assumption states 
that if points are in the same cluster, they are likely 
to be of the same class. Because eigenfunctions of a 
Gaussian covariance function correspond to clusters of 
data points, we can choose clusters - based on both 
labeled and unlabeled data points - relevant for the 
predictions. 

The rest of the paper is organized as follows. Sec- 
tion 2 describes the background of GPs. Section 3 
and 4 present the EigenGP model, EP inference for 
EigenGP, and expectation maximization updates for 
sparsification. In Section 5, we discuss related works 
- in particular, the difference between EigenGP and 
the Nystrom method (Williams & Seeger, 2001) and 
relevance vector machine (Tippings, 2000). Section 
6 shows experimental results on regression, classifi- 
cation and semi-supervised classification, demonstrat- 
ing improved predictive performance of EigenGP over 
state-of-the-art approaches, including support vector 
machines, GPs and sparse GPs. 

2 Background of Gaussian processes 

We denote N independent and identically distributed 
samples as V = {(xi, j/i), . . . , (x„, y„)}jv, where Xj is 
a d dimensional input (i.e., explanatory variables) and 
Ui is a scalar output (i.e., a response), which we assume 
is the noisy realization of a latent function / at x^. 

A Gaussian process places a prior distribution over the 
latent function /. Its projection f x at {Xj}^ defines 
a joint Gaussian distribution: p(f x ) = A/"(f |m°, K), 
where, without any prior preference, the mean m° 
are set to and the covariance function fc(xj,Xj) = 
K(xj,X i j) encodes the prior notion of smoothness. A 
typical choice of k is Gaussian covariance (or kernel) 

fc (x,x0 = exp(-^=^) ! (1) 

where r\ controls the smoothness of the function. Note 
that this covariance function has the same value as 
long as I |x'— x| I remains the same - regardless where x' 
and x are. Thus this leads to a stationary GP model. 

For regression, we use a Gaussian likelihood function 

p(Vi\f)=M'(Vi\f(xi),Vv), (2) 

where v y is the observation noise. For classification, 
the data likelihood has the form 

p(Vi\f) = (1 - eM/(x,)i&) + «r(-/(xi)lfe) (3) 

where e models the labeling error and cr(-) is a cumula- 
tive distribution function (cdf) of a standard Gaussian 
(i.e., the probit model). 




Figure 1: Deep structure of EigenGP. 

Given the Gaussian process prior over / and the data 
likelihood, the exact posterior process is 

N 

K/l^y)ocGP(/|0,fc)JJp(y i |/) (4) 

i=i 

For the regression problem, the posterior process has 
an analytical form. But to make a prediction on a 
new sample, we need to invert a N by N matrix. If 
the training set is big, this matrix inversion will be too 
costly. For classification or other nonlinear problems, 
the computational cost is even higher because we do 
not have an analytical solution to the posterior pro- 
cess and the complexity of the process grows with the 
number of training samples. 

3 Model 

To obtain a nonstationary covariance function and en- 
able fast inference, EigenGP projects the GP prior in 
an eigensubspace and add a white noise Gaussian pro- 
cess #o(x) of constant variance wq so that its prediction 
uncertainty does not shrink to zero at infinity. Specif- 
ically, we set the latent function / 

L 

/(x) = X;%^(x) + o (x) (5) 

J=l 

where 0^'(x) are eigenfucntions of the GP prior. We 
assign a Gaussian prior over 8 = [61, . . . , 9£\, 6 ~ 
jV(0, diag(w)), so that / follows a GP prior with zero 
mean and the following covariance function 

L 

fc(x, x') = u^(x)0V) + w S x . x , (6) 

j'=i 

where <5 X , X ' = 1 if x is the same as x' and <5 XjX ' = 
otherwise. We choose L in (5) to be a reasonably small 
number so that we can conduct efficient inference for 
this model as shown later. 

To obtain the eigenfunctions <^'(x) of a GP prior, we 
can use the Galekin projection to approximate them by 
Hcrmitc polynomials (Marzouk & Najm, 2009). But 



for high dimensional problems, this approach requires 
a tensor product of univariate Hermite polynomials 
that dramatically increases the number of parameters. 

To avoid this problem, we use the Nystrom method 
(Williams & Seeger, 2001) that allows us to efficiently 
obtain an approximation to the eigenf unctions in a 
high dimensional space. Specifically, assuming basis 
points B = [bi,...,bg] (Q > L) are i.i.d. samples 
from the probability density p(x), we can replace 

J fc(x,x')^(x)p(x)dx = A J ^'(x) (7) 

by its Monte Carlo approximation 

1 Q 

-Y^K^MWihi)- A^'(x) (8) 

y i=l 

Then, with simple derivations, we obtain the j-th 
cigenfunction ip J (x) as follows 

^(x) = ^k(x)u J= k(x)u 3 (9) 

where k(x) = [fc(x, bi), . . . , k(x, bq)], Xj and u.j are 
the j-th eigenvalue and eigenvector of the covariance 

function evaluated at B, and Uj = ^-iij. In practice, 

j 

we often select the basis points B as a random subset 
of X. We can also first estimate p(x) based on X 
and then sample multiple B from p(x) so that we can 
obtain estimation uncertainty in <^(x) (i.e., Uj). 

Inserting (9) into (5), we obtain 

l Q 

/(x)=5^^2« -fc(x J b < ) + ffo(x) (10) 

3=1 i=l 

This equation reveals a two-layer structure of our 
model as visualized in Figure 1 (for simplicity, we do 
not shown #o( x ) hi this figure). Thus our model can be 
viewed as a deep Bayesian kernel machine. The deep 
structure highlights the difference between our model 
and the relevance vector machine (Tippings, 2000), 
which links the kernel function fcj to / directly and 
does not have the additional white noise wq- 

To learn the structure of the second layer in EigcnGP, 
we use an empirical Bayesian technique, automatic rel- 
evance determination (ARD) (MacKay, 1992), which 
prunes edges (elements of w) by maximizing model ev- 
idence (See Section 4 for more details). Since L < N 
and w is sparsified, estimating the posterior process 
of / is computationally efficient. Also, we constrain 
/ in the eigensubspace and therefore reduce noise in 
function values, which robustifies the model. If we 
have more training samples and allow a longer train- 
ing time, we can increase L, i.e., the eigensubpsace for 



our model. In this fashion, EigenGP can be viewed as 
the method of sieves (Geman & Hwang, 1982), which 
has been applied to semi-nonparametric models with 
great success (Chen, 2007). 

Note that given w and U = [ui,...,Ui], the prior 
over / is nonstationary because its covariance function 
EigenGP in (6) varies at different regions of x. This 
comes at no surprise since the eigenfunctions are tied 
with p(x) in (7). This nonstationarity reflects the fact 
that our model is adaptive to the distribution of the 
explanatory variables x. 

If we use the Gaussian kernel (1) whose eigenfunctions 
correspond to clusters of data, we can use the cluster 
assumption for semi-supervised learning. More specif- 
ically, we first use both labeled and unlabeled data to 
learn clusters in the whole dataset. Assuming that 
most data points in the same cluster share the same 
label, we can then propagate the labels of points in a 
cluster to unlabeled data points in the same cluster. 
However, when labeled data points in the same cluster 
have different signs, our ARD sparsification allows us 
to automatically choose appropriate eigenfunctions to 
accommodating the sign change in a principled way. 

3.1 An illustrative example 

Now we give a toy example in Figure 2 to illustrate 
why the selection of eigenfunctions removes noise from 
function values and leads to easy classification. We 
also visualizes the nonstationarity of our model. 

Given 300 samples from a mixture of four Gaussian 
components (See green "+" markers in the top panel 
of 2. a), wc estimate the eigenfunctions based on a 
Gaussian covariance function with the kernel width 
T) = 0.2. The 1st, 10th, 20th and 30th eigenfuntions 
are shown in Figure 2. (a). The eigenfunctions control 
the smoothness of the model. Using more eigenfunc- 
tions, we can capture more variability in function val- 
ues. However, too much modeling flexibility does not 
help prediction accuracy; in Figure 2. (a), we can see 
the 10th, 20th and 30th eigenfunctions are not useful in 
discriminating samples from two classes, represented 
by red circles and black "x" markers. 

Figure 2.(b) shows that our method identifies four 
discriminative eigenfunctions for classification; the 
weights of the selected eigenfunctions arc shown in Fig- 
ure 3. (a). The first, third and forth eigenfunctions are 
selected and they separates samples in the first, sec- 
ond and third data clusters from each other while each 
cluster contains samples with the same labels. What is 
interesting is that the second eigenfunction that cov- 
ers the forth cluster is not selected, while this cluster 
contains labels from two classes. Instead, the ninth 
cigenfunction is selected and it separates samples in 
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(a) 1st, 10th, 20th and 30th eigen- 
funtions 



(b) Selected eigenfunctions when a 
cluster having samples from two 
classes 



(c) Selected eigenfunctions when a 
cluster contains a mislabeled data- 
point 




(d) Stationary Gaussian covariance 
function based on all the eigenfunc- 
tions 



(e) Nonstationary covariance function 
based on the selected eigenfunctions 
in (b) 




(f) Nonstationary covariance function 
based on the selected eigenfunctions 
in (c) 



Figure 2: A toy example. Subfigurc (a) shows the eigenfunctions of the GP model with a Gaussian covariance. 
(b) and (c) depict the selected eigenfunctions learned with different labeled samples, (d) shows the Gaussian 
covariance function evaluated at x and x' (both ranging from -1 to 7). It has a constant value along the 
diagonal direction, (e) and (f) visualize the nonstationary covariance functions corresponding to the selected 
eigenfunctions in (b) and (c). 



the forth cluster - in other words, the samples with 
the same label correspond to the same sign in this 
eigenfunction. Therefore, based on these four selected 
eigenfunctions we can accurately classify the samples. 

In Figure 2.(c), the samples in the forth cluster belong 
to one class, except a single outlier (the red circle) . In 
this case, EigcnGP automatically selects the top four 
eigenfunctions whose weights are shown in 3.(b). Since 
the second eigenfunction covers the forth cluster with 
the same sign, EigenGP removes the impact of the out- 
lier, demonstrating the robustness of our model. This 
is similar to noise reduction in principle component 
analysis but in a functional space for the output. 

Figures 2.(d)-(f) show the values of the covariance 
functions corresponding to 2, demonstrating the non- 
stationarity of our model. (d)-(f). As shown in 2.(d), 
the stationary Gaussian covariance function in (1) has 
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(a) Based on data in Fig. (b) Based on data in Fig. 
2.(b) 2.(c) 

Figure 3: Estimated w for the classification of two data 
sets in Figure 2.(b) and (c). Only a few eigenfunctions 
are associated with nonzero weights. 



a constant value along the diagonal direction in the 
image. This is because the covariance value remains 
the same when ||x — x'|| is a constant regardless where 
the two samples x and x' are. In contrast, the covari- 
ance functions of EigcnGP, defined by 6, change their 
values based on the values of x and x' along the di- 
agonal direction as shown in (e) and (f). Note that 
the upper-right corner of (e) corresponds to the ninth 
cigenfunction (the bottom blue curve in (b)). 

4 EP based evidence maximization 

For Gaussian likelihoods (i.e., regression), we can use 
the matrix inverse lemma to efficiently compute the 
posterior process in (4) and the marginal likelihood. 
For general nonlinear likelihoods such as the probit 
model for classification, we can use expectation prop- 
agation (Minka, 2001) to map the nonlinear likelihood 
into a linear Gaussian form Af(gi\f(jx.i), r^) and ap- 
proximate the posterior process in (4) by 

N 

q(f) CX GP(f\0, k) liM(9i\f(xi),Ti). (11) 

i=l 

Let g = [gi, . . . , g N ], r = [n, . . . , t n ], and $x> (a N by 
L matrix) represent the values of the L basis functions 
at the N training points. Then combining (6) and (11) 
with equations (6.66) and (6.67) in (Bishop, 2006), we 
obtain the following proposition. 

Proposition 1 The posterior process q(f) defined in 
(11) has the mean function m(x) and covariance func- 
tion V{yi, x'): 

m(x) = k(x)Ua (12) 
V(x, x') = k(x, x') - k(x)U/3U T k(x') (13) 

where W = diag(w), (3 = W$ C T (K + w I + 
diag(x)) _1 $x'W, K = <FpW<Fp T and a = (3g. 

From this proposition, we see the additional white 
noise plays a role equivalent to the Gaussian approx- 
imation of the likelihoods so that we can absorb it in 
the likelihoods. For the probit model, this amounts to 
increasing the variance of the Gaussian in the cdf. 

To estimate (a, (3) (equivalcntly, (g,r)), the expec- 
tation propagation inference repeats the following 
three steps: message deletion, projection, and mes- 
sage update. In the message deletion step, we com- 
pute the partial belief q\' l (f;a\ l ,/3\' 1 ) by removing a 
message ti from the approximate posterior q(f;a,(3): 
q\' t (f;a\ l , f3\ l ) oc q(f; a, (3)/ti. In the projection 
step, we minimize the KL divergence between p(f) oc 
p{yi\f)q{f;a" t ,fl\' 1 ) and the new approximate poste- 
rior q(f- a, (3), such that the information from the i-th 
data point is incorporated into the model. Finally, 



the message U is updated based on the new and old 
posteriors: U oc q(f;a,f3)/q\ l (f;a\ l ,f3^ 1 ). 

4.1 Projection 

We start with the projection step, since it is the most 
crucial step of EP. From (11), we see that the poste- 
rior GP of EigcnGP defines an exponential family with 
features {f x , f x fj }, f x = (/(xi), . . . , f(x Nl )) T . There- 
fore, the minimization of the KL divergence between 
p(f) and its posterior process q(f) is achieved by mo- 
ment matching on the mean m x and the covariance 
V x of f x . The moment matching equations arc 

™" = ^' + vV < x - x ->M < 14 » 

V x = V\< + V\<(X,xOg^L^V\<(x i ,X) (15) 

where Z = J q^ l {f)p{yi\f) d/, and m^(xi) is the mean 
of q\ l (f(xi)) and V\ l (xi,X) is covariance matrix be- 
tween Xi and X. 

Given m x and V x , we want to compute a and [3. To 
do so, first note that, from Proposition 1, it follows 
that 

m x = K x3 U Q V x = K-K XjB U/?UX,b (16) 

where K = K x .bUWU t KJ b , and the element 
of the matrix K x b is k(xi, hj). 

Then combining (14) and (16), we obtain 



a = cA 1 + h 



dlogZ 
dm\ l (xi) 



(17) 



where h 4 K- 1 V\ i (X,x<) = Pi - /A l 4 , Pi 4 
diag(w)</>j and <p i = k(xj)U is precomputcd before 
the EP inference. 



Inserting (16) to (15), we obtain the update for (3 



(18) 



dm\'() 

The above equations define the projection step. 

For the classification likelihood (3) where a(-) is the 
probit function, the quantities in the projection step 
(17) and (18) are 



yW(x^Xj) 
Z = e+(1- 2e)cr(z) 



dlogZ 

dmA l (xi) 

d 2 logZ 'j{mS i {xi)yi + ?A 8 (xi)7) 



(dmV( Xl )) 2 



A*(xi) 



(19) 
(20) 
(21) 

(22) 



where 7 



(l-2e)AT(g|0,l) 



4.2 Message update 

Given the new q(f), we will update the message U(f) 
according to the ratio q(f)/q^(f)'- 

Uf)= |Vx|-* ex P H(f x - m x )V~ 1 (f x - m x )) 

\v)>\-t exp(-|(f x - m^CVx*)- 1 ^ - m,*)) 



A 

T, = - 



Af(f(xi)\g h n) 
, d 2 logZ , 
(dmV( Xt )) 2 ' 



(23) 
(24) 



d 2 logZ N _i dlogZ 



9i - mV(Xl) - WS) 2P dSR (25) 

where v^(x.j) = </>7 n an d m ^'( x i) = (pjo^ 1 . The 
function tj(/) can be viewed as a message from the 
i-th data point to q(f). 

If we want to approximate the marginal likelihood of 
the model, we need to scale J\f(f(x.i)\gi, ri) so that the 
message = s i exp(f(x i \g i ,T i )) = Zq{f)/q\ l (f) 

preserves the local "evidence" - in other words, 
/ ti(fh V (fW = I *i(/)? V (/)d/. It is easy to show 

log s = \ogZ - 1 iog(-(io g z)" n ) - ((bgZ) ? 2 

5 2 5V V 5 ) %) 2(logZ)" 

where (logZ)' and (logZ)" are the first- and the 
second- order derivatives of \o%Z over m\ l (xj). 

4.3 Message deletion 

To delete a message U(f), we need to compute <r z (/) oc 
q(f)/ti(f). Instead of computing this ratio directly, 
we can cquivalcntly multiply its reciprocal with the 
current q(f). Then we can solve the multiplication 
by minimizing KL(q(f)\\q\ l (/%(/)) over<7 V (/). Since 
q(f), q\ l {f), and ii(/) all have the form of the expo- 
nential family, the minimal value of this KL-divergence 
is 0, so that q^(f) oc q(f)/U(f). This KL minimiza- 
tion can be easily done by the moment matching equa- 
tions similar to (17) and (18): 



dm(xi) 



(dm(x,)) 2 



where 



h\< 4 Pi - /ty. 
1 

oc 7V(u i |Pi r m x ' i , -Tj + w(xi)) 



Q V (fW 



d 2 \ogZ d 
dm(xj) 2 

dlogZ d = d 2 \ogZ 
dm(xj) dm(xj) 



T 4 - v(Xi) 

2 (m(xf) 



Note that we compute ra(x,) = </>7 a an d u(xj) = 
^(diag(w) - fifr = cf>Jh\\ 

4.4 Eigenfunction selection 

EigenGP uses a small subset of cigcnfunctions. But 
unlike principle component analysis, we can improve 
the modeling power by not simply picking the top few 
eigenfunctions, as illustrated in Figure 2.(b) where 
the ninth, instead of the second, eigenfunction is se- 
lected. To select relevant eigenfunctions, we maximize 
the model marginal likelihood obtained from expecta- 
tion progagation: 

p(y|w,X)= /V(fx|0 I K)IJf i (/)dk 

i 



(27) 



.exp(±(a T (W-/?)- 1 a-£<7 2 MM)) 



We maximize the marginal likelihood via automatic 
relevance determination (ARD) (MacKay, 1992). To 
do so, we use an EP-EM approach. In the E-step, we 
compute q(f) via EP. In the M-step, wc maximize the 
expected likelihood Eg [logp(y, f x |w)] over w, which 
leads to the following update: 



0ij + > 



(28) 



Instead of the EM updates, we can use an active-set 
method (Faul & Tipping, 2002; Qi ct al., 2004) to ob- 
tain efficient updates of w. We can certainly explore 
other sparsification approaches, like l\ penalty (i.e, a 
Laplace prior), an elastic net penalty, or spike and slab 
priors to sparsify w. But a thorough comparison of 
various sparse priors is out of the scope of this paper. 

4.5 Computational complexity 

Since , , d }°, sZ „, is a scalar and h is a L by 1 vec- 

tor, it takes 0(L 2 ) to update (3 via (18). Similarly, it 
takes 0(L 2 ) to update s - 1 via (26). Therefore, given 
Ni labeled training points, the computational cost is 
0(L 2 Ni) per EP iteration over all the data points. 
Since in practice the number of EP iterations is small 
(e.g., 10), the overall cost is 0{L 2 Ni). Because of us- 
ing the active-set method, the computation cost will 
be further significantly reduced to 0(r 2 Ni) where r is 
the actual number of eigenfunctions used in the EP 
iterations and r < L. In addition, we have the cost of 
computing the top L eigenvectors U of the Q by Q ker- 
nel matrix Kb ; using efficient iterative algorithms such 
as the Lanzcos method, it takes 0(Q 2 L). In summary, 
the complexity of the EP inference is 0(Q 2 L + N[L 2 ). 



5 Related work 

Our work is built upon the seminal work by Williams 

6 Seeger (2001). But they differ in three critical as- 
pects. First, we define a valid probabilistic model 
based on an eigcn-decomposition of the GP prior. By 
contrast, the previous approach by Williams & Seeger 
(2001) aims at a low-rank approximation to the finite 
covariance/kerncl matrix used in GP training - purely 
from a numerical approximation perspective - and its 
predictive distribution is not well-formed in a proba- 
bilistic framework (e.g., it may give a negative variance 
of the predictive distribution). Accordingly, both pre- 
dictive means and variances of these two methods are 
different. Second, while the previous Nystrom method 
simply uses the first few eigenvectors, we maximize the 
model marginal likelihood to select eigenfunctions and 
adjust their weights, learning a nonstationary covari- 
ance function. Third, exploring the clustering prop- 
erty of the eigenfunctions of the Gaussian kernel, our 
approach can conduct semi-supervised learning, while 
the previous one cannot. 

Our model also bears similarity to relevance vector ma- 
chine (RVM) (Tippings, 2000) and sparse spectrum 
Gaussian process (SSGP) by Lazaro-Gredilla et al. 
(2010). But ours differs from these methods in four 
aspects. First, based on the K-L expansion of a co- 
variance function, the basis functionsn of ours is quite 
different those used in RVM or SSGP. Second, with 
the white noise f3 in the model, ours gives nonzero 
prediction uncertainty when a test sample is far from 
the training samples. In contrast, for this case the 
prediction of RVM shrinks to zero. Third, Figure 1 
shows that our model has a two-layer structure, while 
RVM and SSGP do not. Note that while we estimate 
the parameters u and w (or equivalcntly 9) associated 
with the two layers, our first layer training does not 
depend on the label y - in this sense, it echos the idea 
of unsupervised first layer training used in deep learn- 
ing. Forth, our method can conduct semi-supervised 
learning while RVM and SSGP cannot. 

6 Experiments 

In this section we test EigenGP on regression, classifi- 
cation, and semi-supervised classification tasks. 

6.1 Regression 

For regression, we compared EigenGP with the full 
GP and two sparse approximate GP algorithms: the 
Nystrom method (Williams & Seeger, 2001), and the 
pseudo-input approach, also known as Fully Inde- 
pendent Training Conditional (FITC) approximation 
(Snelson & Ghahramani, 2006). We used the Gaus- 
sian covariance (1) for all the competing methods. We 



4.4 
4.2 
4 

LU 3.8 
* 3.6 
3.4 
3.2 




-Full-GP 
FITC 

-EigenGP 
EigenGP* 



50 100 150 

Number of Basis/Rank 



(a) Boston Housing 



200 



1.7 
1.6 
1.5 

tu 1 4- 
(fi '- H 

i 1.3 
1.2 

1.1 - 
1 - 
0.9- 



— Full-GP 
FITC 

---Nystrom 

— EigenGP 
EigenGP' 



50 100 150 

Number of Basis/Rank 



(b) Pumadyn-8nm 



200 



Figure 4: Regression results on Boston Housing and 
Pumadyn-8nm. The results of the Nystrom method 
on Boston Housing arc not included in the figure, since 
their errors are much larger than the others. 

optimized the kernel width rj for the full GP and used 
it for the Nystrom method and EigenGP for a sim- 
ple fair comparison. For the FITC, we used the opti- 
mization code from http : //www . gatsby . ucl . ac . uk/ 
~snelson/SPGP_dist .tgz to learn the kernel width 
for each dimension and the basis points. Because the 
optimization is sensitive to local optima (as observed 
by (Snelson & Ghahramani, 2006) and (Qi et al., 
2010)), we tried different initial values to improve the 
results of FITC and presented here the best results we 
obtained (which are better than those of FITC using 
the kernel parameters learned from the full GP). For 
EigenGP, we set wo = 0.1 to have small white noise. 

We compared the prediction accuracies of these alter- 
native methods with the same computational complex- 
ity. Let Qi and L be the number of basis points and 
the total number of eigenfunctions of EigenGP and 
the Nystrom method, and Qi be the number of ba- 
sis points of FITC. The computational complexities of 
EigenGP and the Nystrom method is 0(Q\L + NL 2 ), 
while FITC takes (^(iVQ^)- To make these algorithms 



have the same cost, we required 0(Q 2 L + NL 2 ) = 
0(NQ 2 ). To achieve this, we could set L = Q 2 and 
Qi is smaller than yWQ^. In our experiments, we 
simply set Qi = Q2 as well (This gave FITC more 
training time because of N > Q-i). 

The results on two benchmark datasets, Boston Hous- 
ing and Pumadyn-8nm, are reported in Figure 4. 
These two datasets contain 506 and 8192 data points. 
From these two datasets, we randomly selected 400 
and 2000 points for training, respectively. And we used 
the rest for testing. We repeated these experiments 10 
times and reported the average root mean square er- 
rors (RMSE) as well as the standard errors in Figure 4. 
For EigcnGP, we not only ran the version with sparsifi- 
cation over w as described before, but also tested a ver- 
sion of EigenGP that simply chose the top L eigenfunc- 
tions. We denote the later version as EigenGP*. The 
RMSE of the Nystrom method on Boston housing are 
312.5, 41.68, 15.17 and 6.480 with 50, 100, 150 and 200 
basis points, respectively. The corresponding standard 
errors arc 70.45, 11.23, 6.194 and 1.502. Despite the 
similarity between the two versions of EigenGPs and 
the previous Nystrom method, EigenGPs significantly 
outperformed the Nystrom method on Boston Hous- 
ing consistently and on Pumadyn-8nm when the num- 
ber of basis points is small the identical experimen- 
tal setting - such as the same hyperparamcters and 
the basis points between EigenGP and the Nystrom 
method - in our experiments (because EigenGP* did 
not select eigenfunctions and estimate w via ARD, it 
even used the same weights as the Nystrom method). 
This shows that when the number of basis points is 
small, the Nystrom method suffers severely as the 
quality of the numerical approximation of the covari- 
ance matrix degenerates, while by contrast EigenGP 
and EigenGP*, as valid probabilistic models, degrade 
their performance smoothly. 

6.2 Supervised classification 

For supervised classification, we compared EigenGP 
with the full GP and three sparse GP algorithms: 
the Nystrom method, Sparse Online Gaussian Pro- 
cesses (SOGP) (Csato & Opper, 2002), and FITC ap- 
proximations (Snclson & Ghahramani, 2006; Naish- 
Guzman & Holden, 2008). We used the Gaus- 
sian covariance function for all these algorithms and 
tuned their kernel width by cross-validation. For the 
Nystrom method, we used Laplace's method to ap- 
proximate the posterior distribution as described in 
(Williams & Seeger, 2001). A better comparison with 
the Nystrom method would be using EP, instead of 
the less effective Laplace's approximation. However, 
there is no previous work that combines EP with the 
Nystrom method and the development of this algo- 



rithm is out of the scope of this paper. For the FITC 
model, we used EP for the posterior approximation 
as proposed by Naish-Guzman & Holden (2008). Al- 
though we did not maximize the model evidence to 
learn the kernel parameters for the FITC model (thus 
we may not obtain their best results), we found in 
practice that our extensive cross-validation of kernel 
parameters often gave prediction accuracies at least 
comparable to those based on evidence maximization. 

We tested these algorithms on two benchmark 
datasets: Spambase and USPS. We randomly split the 
Spambase datasct into 2300 training and 2300 test 
samples 10 times and ran all the competing meth- 
ods on each partition. For each partition, we chose 
the centers from K-means clustering of a randomly se- 
lected subset of the whole dataset as basis points for 
all the sparse GP methods. For the USPS dataset, we 
conducted three digit classification tasks: 8 vs 9, 3 vs 
8, and 5 vs 8. Each digit in USPS was treated as a 
256 dimensional vector. We randomly selected 1200 
training and 1000 test samples and repeated the par- 
tition 10 times. In USPS, we used the same procedure 
to select basis points when the number of basis points 
is smaller than 500; when it is bigger than 500, we 
selected the basis points randomly. 

As shown in Figure 5. (a), by sparsifying w, EigenGP 
consistently outperforms EigenGP* (which does not 
sparsify w) on the classification of digits 8 and 9. 
This improvement verifies the benefit of selecting rele- 
vant eigenfunctions for classification. Figures 5.(b)-(d) 
demonstrate that, with the same or less computational 
cost, EigenGP achieved lower classification error rates 
than the other sparse GP algorithms. Interestingly, 
EigenGP even outperforms the full GP. Two possible 
reasons are i) that by eigenfunctions selection, we ob- 
tain a nonstationary GP model that can better reflect 
the local smoothness of the latent function and remove 
labeling noise just like PCA for noise reduction, and 
ii) that the clustering property of eigenfunctions helps 
improve classification accuracy. 

6.3 Semi-supervised classification 

We compared EigenGP with three well-known semi- 
supervised learning algorithms: the graph rcgulariza- 
tion (GR) approach (Zhou ct al., 2004), the Lapla- 
cian support vector machine (LAPSVM) (Belkin et al., 
2006), and the sparse eigenfunction bases approach 
(SEB) (Sinha & Belkin, 2010). We did not include 
EigenGP* here because, without eigenfunction selec- 
tion, it is not designed for semisupervised learning. 
For semi-supervised learning, we used both the labeled 
and unlabeled samples as basis points to generate the 
eigenfunctions for EigenGP. We also tested supervised 
SVM as a baseline for comparison. For all these al- 
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Figure 5: Classification results on Spambase and USPS. The results are averaged on 10 random splits of the data 
and the error bars represent the standard errors. 
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Figure 6: Semi-supervised classification results on Diabetes, Ionosphere, TDT2, and 20 Newsgroups. The results 
of SVM on Ionosphere are not reported here since they are much worse than the others. 



gorithms, we used the Gaussian covariance function. 
The same kernel width was tuned by cross-validation. 

We used two UCI datasets, Diabetes and Ionosphere, 
and two text datasets, TDT2 and 20 Newsgroups. 
Diabetes and Ionosphere contain 768 and 351 sam- 
ples, respectively. For the TDT2 dataset, we se- 
lected two biggest categories for classification; for 
the 20 Newsgroup dataset, we chose two categories, 
comp.sys.ibm.pc. hardware and rec. sport. baseball, in 
our comparison. After the selection, we obtained 
1976 and 3672 documents, respectively, from these two 
datasets. We then represented each document by the 
tf-idf term weights of the most frequent 1000 words. 

We varied the number of randomly selected labeled 
samples and repeated the experiments 10 times. Fig- 
ure 6 shows the averaged prediction error rates and the 
standard errors; clearly, EigenGP consistently outper- 
formed the alternative approaches. 

7 Conclusions 

We have presented a new GP model, EigenGP, which 
selects eigenfunctions learned from data. Experiments 
demonstrated EigenGP 's superior performance for re- 
gression, classification and semisupervised classifica- 
tion on several benchmark datasets. What is a little 
surprising is that, with lower computational cost, for 



classification tasks, it can outperform the full GP that 
uses infinite eigenfunctions. We believe the improve- 
ment in both training speed and prediction accuracy 
comes from the selection of relevant eigenfunctions and 
the nonstationarity in the covariance function associ- 
ated with this selection. 

We have used cross-validation to tune the kernel width 
of the covariance function. A future work is to learn 
the hyperparameters automatically from data. Al- 
though addressing this issue is out of the scope of this 
paper, we expect learning hyparameter for EigenGP 
is feasible by adopting a sampling method, for exam- 
ple, the slice sampling method proposed by Murray & 
Adams (2011). 

Finally, we want to point out that the eigenfunctions 
in EigenGP can be viewed as dictionary elements. By 
sharing the dictionary elements across related tasks, 
EigenGP can be easily extended for multi-task learn- 
ing. 
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